5420 Anomaly Detection | Assignment 5 - Joyce Ng (jn2901)¶

Credit Card Fraud Detection Modeling¶

The Office of Management and Enterprise Services in the State of Oklahoma has made its purchase credit card transactions available. This dataset contains information on purchases made through the purchase card programs administered by the state and higher education institutions. This analysis will utilized Histogram Based Outlier Score and Empirical Cumulative Distribution-based Outlier Detection to identify anomalies.

Table of Contents¶

  • Section 1: Data Preparation
    • 1.1 Load Libraries and Dataset
    • 1.2 Check Missing Values & Change Columns Names and Data Type
    • 1.3 Create Data Fields
  • Section 2: EDA
    • 2.1 Distribution of Variables
  • Section 3: Feature Engineering
    • 3.1 Feature Benchmarks
    • 3.2 Feature Creation
    • 3.3 Descriptive Statistics of Features
    • 3.3 Distribution of Features
  • Section 4: Histogram-Based Outlier Score (HBOS)
  • Section 5: Empirical Cumulative Distribution-based Outlier Detection (ECOD)
  • Section 6: Models Predictions Comparison
  • Section 7: Conclusion

 

Section 1: Data Preparation ¶

1.1 Load Libraries and Dataset ¶

In [1]:
# Load libraries
import pandas as pd
import numpy as np
from functools import reduce
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve,roc_auc_score, confusion_matrix, f1_score, accuracy_score, make_scorer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from pyod.utils.data import generate_data
from pyod.models.hbos import HBOS
from pyod.models.combination import aom, moa, average, maximization
from pyod.utils.utility import standardizer
from pyod.models.ecod import ECOD

# Visiualization
import plotly.express as px
import plotly.graph_objs as go
import plotly.subplots as sp
from plotly.subplots import make_subplots
import plotly.figure_factory as ff
import plotly.io as pio
from IPython.display import display
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)
import matplotlib.pyplot as plt
import seaborn as sns

import warnings 
warnings.filterwarnings("ignore")
In [2]:
df = pd.read_csv('/Users/Joyce630/Desktop/Columbia/5420 Anomaly Detection/Assignments/2 - Credit Card/purchase_credit_card.csv')
df.head()
Out[2]:
Year-Month Agency Number Agency Name Cardholder Last Name Cardholder First Initial Description Amount Vendor Transaction Date Posted Date Merchant Category Code (MCC)
0 201307 1000 OKLAHOMA STATE UNIVERSITY Mason C GENERAL PURCHASE 890.00 NACAS 07/30/2013 12:00:00 AM 07/31/2013 12:00:00 AM CHARITABLE AND SOCIAL SERVICE ORGANIZATIONS
1 201307 1000 OKLAHOMA STATE UNIVERSITY Mason C ROOM CHARGES 368.96 SHERATON HOTEL 07/30/2013 12:00:00 AM 07/31/2013 12:00:00 AM SHERATON
2 201307 1000 OKLAHOMA STATE UNIVERSITY Massey J GENERAL PURCHASE 165.82 SEARS.COM 9300 07/29/2013 12:00:00 AM 07/31/2013 12:00:00 AM DIRCT MARKETING/DIRCT MARKETERS--NOT ELSEWHERE...
3 201307 1000 OKLAHOMA STATE UNIVERSITY Massey T GENERAL PURCHASE 96.39 WAL-MART #0137 07/30/2013 12:00:00 AM 07/31/2013 12:00:00 AM GROCERY STORES,AND SUPERMARKETS
4 201307 1000 OKLAHOMA STATE UNIVERSITY Mauro-Herrera M HAMMERMILL COPY PLUS COPY EA 125.96 STAPLES DIRECT 07/30/2013 12:00:00 AM 07/31/2013 12:00:00 AM STATIONERY, OFFICE SUPPLIES, PRINTING AND WRIT...
In [3]:
# Check dimensions of the dataset, we found it has 442,458 rows and 11 columns 
print(df.shape, "\n") # Check the shape of df
print(df.columns, "\n") # Check column names
print(df.info(), "\n") # Check info of the df
df.describe() # Get the Simple Summary Statistics
(442458, 11) 

Index(['Year-Month', 'Agency Number', 'Agency Name', 'Cardholder Last Name',
       'Cardholder First Initial', 'Description', 'Amount', 'Vendor',
       'Transaction Date', 'Posted Date', 'Merchant Category Code (MCC)'],
      dtype='object') 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 442458 entries, 0 to 442457
Data columns (total 11 columns):
 #   Column                        Non-Null Count   Dtype  
---  ------                        --------------   -----  
 0   Year-Month                    442458 non-null  int64  
 1   Agency Number                 442458 non-null  int64  
 2   Agency Name                   442458 non-null  object 
 3   Cardholder Last Name          442458 non-null  object 
 4   Cardholder First Initial      442458 non-null  object 
 5   Description                   442458 non-null  object 
 6   Amount                        442458 non-null  float64
 7   Vendor                        442458 non-null  object 
 8   Transaction Date              442458 non-null  object 
 9   Posted Date                   442458 non-null  object 
 10  Merchant Category Code (MCC)  442458 non-null  object 
dtypes: float64(1), int64(2), object(8)
memory usage: 37.1+ MB
None 

Out[3]:
Year-Month Agency Number Amount
count 442458.000000 442458.000000 4.424580e+05
mean 201357.284375 42785.860353 4.249912e+02
std 47.107417 33378.461293 5.266509e+03
min 201307.000000 1000.000000 -4.286304e+04
25% 201309.000000 1000.000000 3.091000e+01
50% 201401.000000 47700.000000 1.048900e+02
75% 201404.000000 76000.000000 3.450000e+02
max 201406.000000 98000.000000 1.903858e+06

1.2: Check Missing Values & Change Columns Names and Data Type ¶

In [4]:
# Check for missing values
missing_values = df.isnull().sum()
missing_values
Out[4]:
Year-Month                      0
Agency Number                   0
Agency Name                     0
Cardholder Last Name            0
Cardholder First Initial        0
Description                     0
Amount                          0
Vendor                          0
Transaction Date                0
Posted Date                     0
Merchant Category Code (MCC)    0
dtype: int64

1.3: Create Data Fields ¶

In [5]:
# Change column names 
df.columns = ['Year_Month', 'Agency_Number', 'Agency_Name', 'Cardholder_Last_Name',
      'Cardholder_First_Initial', 'Description', 'Amount', 'Vendor', 'Transaction_Date',
      'Posted_Date', 'Merchant_Category']
In [6]:
# Creating separate Year and Month columns for future feature engineering
# Converting Transaction_Date and Posted_date from Python Object data type to datetime
df['Transaction_Date']=pd.to_datetime(df['Transaction_Date'])
df['Posted_Date']=pd.to_datetime(df['Posted_Date'])

df['Year_Month'].dtype # Check data type for information purpose
df['Year_Month']=pd.to_datetime(df['Year_Month'], format='%Y%m')
df['Year'] = df['Year_Month'].dt.year
df['Month'] = df['Year_Month'].dt.month

#Creating 'Week_Number' from 'Transaction_Date'
df['Week_Number'] = df['Transaction_Date'].dt.isocalendar().week

#Creating 'Day_of_Week' from 'Transaction_Date'
dayOfWeek={0:'Monday', 1:'Tuesday', 2:'Wednesday', 3:'Thursday', 4:'Friday', 5:'Saturday', 6:'Sunday'}
df['Day_of_Week'] = df['Transaction_Date'].dt.dayofweek.map(dayOfWeek)

df['Day_of_Week'].value_counts()
Out[6]:
Day_of_Week
Wednesday    89644
Thursday     87537
Tuesday      86919
Friday       79917
Monday       61809
Saturday     26307
Sunday       10325
Name: count, dtype: int64
In [7]:
df.head()
Out[7]:
Year_Month Agency_Number Agency_Name Cardholder_Last_Name Cardholder_First_Initial Description Amount Vendor Transaction_Date Posted_Date Merchant_Category Year Month Week_Number Day_of_Week
0 2013-07-01 1000 OKLAHOMA STATE UNIVERSITY Mason C GENERAL PURCHASE 890.00 NACAS 2013-07-30 2013-07-31 CHARITABLE AND SOCIAL SERVICE ORGANIZATIONS 2013 7 31 Tuesday
1 2013-07-01 1000 OKLAHOMA STATE UNIVERSITY Mason C ROOM CHARGES 368.96 SHERATON HOTEL 2013-07-30 2013-07-31 SHERATON 2013 7 31 Tuesday
2 2013-07-01 1000 OKLAHOMA STATE UNIVERSITY Massey J GENERAL PURCHASE 165.82 SEARS.COM 9300 2013-07-29 2013-07-31 DIRCT MARKETING/DIRCT MARKETERS--NOT ELSEWHERE... 2013 7 31 Monday
3 2013-07-01 1000 OKLAHOMA STATE UNIVERSITY Massey T GENERAL PURCHASE 96.39 WAL-MART #0137 2013-07-30 2013-07-31 GROCERY STORES,AND SUPERMARKETS 2013 7 31 Tuesday
4 2013-07-01 1000 OKLAHOMA STATE UNIVERSITY Mauro-Herrera M HAMMERMILL COPY PLUS COPY EA 125.96 STAPLES DIRECT 2013-07-30 2013-07-31 STATIONERY, OFFICE SUPPLIES, PRINTING AND WRIT... 2013 7 31 Tuesday
In [8]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 442458 entries, 0 to 442457
Data columns (total 15 columns):
 #   Column                    Non-Null Count   Dtype         
---  ------                    --------------   -----         
 0   Year_Month                442458 non-null  datetime64[ns]
 1   Agency_Number             442458 non-null  int64         
 2   Agency_Name               442458 non-null  object        
 3   Cardholder_Last_Name      442458 non-null  object        
 4   Cardholder_First_Initial  442458 non-null  object        
 5   Description               442458 non-null  object        
 6   Amount                    442458 non-null  float64       
 7   Vendor                    442458 non-null  object        
 8   Transaction_Date          442458 non-null  datetime64[ns]
 9   Posted_Date               442458 non-null  datetime64[ns]
 10  Merchant_Category         442458 non-null  object        
 11  Year                      442458 non-null  int32         
 12  Month                     442458 non-null  int32         
 13  Week_Number               442458 non-null  UInt32        
 14  Day_of_Week               442458 non-null  object        
dtypes: UInt32(1), datetime64[ns](3), float64(1), int32(2), int64(1), object(7)
memory usage: 46.0+ MB

Section 2: EDA ¶

2.1: Distribution of Variables ¶

In [9]:
# Summarize the count statistics by agency_name in df_count
df_count = df['Agency_Name'].value_counts()
df_count = pd.DataFrame(df_count).reset_index()
df_count.columns = ['Agency_Name','Count']

fig = px.bar(df_count, x='Agency_Name', y='Count', color = 'Agency_Name', width=1000, height=400)

sns.set_theme(style="whitegrid")
sns.barplot(y="Agency_Name", x="Count", data=df_count[1:20])
plt.title('Distribution of Agency', fontweight='bold')
plt.xlabel('Count')
plt.ylabel('Agency Name')
plt.show()
No description has been provided for this image

Distribution of Transaction Amount¶

In [10]:
# Set the style of the seaborn plot
sns.set_style("whitegrid")

# Create a histogram for transaction amounts with a log scale
plt.figure(figsize=(10, 6))
ax = sns.histplot(data=df['Amount'], bins=100, kde=True, log_scale=True)

# Format x-axis tick labels to add comma formatting
ax.xaxis.set_major_formatter(plt.FuncFormatter(lambda x, _: '${:,.0f}'.format(x) if x >= 1 else '${:.2f}'.format(x)))

plt.title('Distribution of Transaction Amounts', fontweight='bold')
plt.xlabel('Transaction Amount')
plt.ylabel('Frequency')
plt.savefig('Distribution of Transaction Amount.png', transparent=True)

# Display the plot
plt.show()
No description has been provided for this image

Interaction of Transaction Amounts by Agency and Merchant Category¶

In [11]:
# Calculate the total transaction amount for each Agency and Merchant Category
df_grouped = df.groupby(['Agency_Name', 'Merchant_Category'])['Amount'].sum().reset_index()

# Take top 40 agencies for a clearer visualization
df_grouped = df_grouped.sort_values(by='Amount', ascending=False)
df_grouped = df_grouped.head(40)
# Sort DataFrame by Amount
df_grouped = df_grouped.sort_values(by='Amount', ascending=True)

# Create the scatter plot
fig = px.scatter(df_grouped, 
                 x="Agency_Name", 
                 y="Merchant_Category", 
                 color="Amount", 
                 size="Amount", 
                 title='Interaction of Transaction Amounts by Agency and Merchant Category',
                 width=1000, height=600,
                 color_continuous_scale='Viridis_r'
                )

# Customize layout to edit legend title and set background to white
fig.update_layout(
    coloraxis_colorbar=dict(
        title="Amount ($)"  # Legend title
    ),
    title={
        'text': 'Interaction of Transaction Amounts by Agency and Merchant Category',
        'y':0.9,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'
    },
    paper_bgcolor="rgba(0,0,0,0)",  # Set the background color of the paper (outer) to transparent
    plot_bgcolor='white'  # Set the background color of the plotting area (inner) to white
)

# Customize x-axis
fig.update_xaxes(
    title = 'Agency Name',
    mirror=True,
    ticks='outside',
    showline=True,
    linecolor='lightgrey',
    gridcolor='lightgrey',
    showgrid=False,
    tickangle=-30
)

# Customize y-axis
fig.update_yaxes(
    title = 'Merchant Category',
    mirror=True,
    ticks='outside',
    showline=True,
    linecolor='lightgrey',
    gridcolor='lightgrey',
    showgrid=False
)

fig.write_image("Interactions of two or three variables.png")

# Show the plot
fig.show()

Section 3: Feature Engineering ¶

  1. Avg amount by agency, merchant
  2. Avg amount by agency, merchant in last 30 days
  3. Avg amount by agency, merchant in last 60 days
  4. Avg amount by agency, merchant in last 90 days
  5. Max amount by agency, merchant in last 30 days
  6. Max amount by agency, merchant in last 60 days
  7. Max amount by agency, merchant in last 90 days
  8. Avg transaction by agency, merchant
  9. Avg transaction by agency, merchant in last 30 days
  10. Avg transaction by agency, merchant in last 60 days
  11. Avg transaction by agency, merchant in last 90 days
  12. Avg among by agency, vendor
  13. Avg amount by agency, vendor in last 30 days
  14. Avg amount by agency, vendor in last 60 days
  15. Avg amount by agency, vendor in last 90 days

Features mainly focused on 4 main categories: Average Amount by Agency and Merchant, Maximum Amount by Agency and Merchant, Average Transaction Count by Agency and Merchant, Average Amount by Agency and Vendor. Within each category, we will look at the overall transaction data and across different time period i.e. in last 30, 60, 90 days to identify spending patterns and potential anomalies.

3.1.1 Avg Amount by Agency, Merchant Ratio¶

In [12]:
# Define functions and variables for benchmark calculations
def calculate_avg_amount(data, group_by_cols):
    return data.groupby(group_by_cols)['Amount'].mean().reset_index(name='Avg_Amount')

def calculate_max_amount(data, group_by_cols):
    return data.groupby(group_by_cols)['Amount'].max().reset_index(name='Max_Amount')

def calculate_avg_transaction(data, group_by_cols):
    return data.groupby(group_by_cols).size().reset_index(name='Avg_Transaction_Count')

def calculate_last_n_days(data, n):
    cutoff_date = data['Transaction_Date'].max() - pd.Timedelta(days=n)
    return data[data['Transaction_Date'] > cutoff_date]

# Define the group by columns
agency_merchant_cols = ['Agency_Name', 'Merchant_Category']
agency_vendor_cols = ['Agency_Name', 'Vendor']
In [13]:
table_1 = df.groupby(['Agency_Name', 'Merchant_Category'])['Amount'].mean().reset_index()
table_1 = table_1.rename(columns={'Amount': 'Average_Merchant_amount'})
table_1.head()
Out[13]:
Agency_Name Merchant_Category Average_Merchant_amount
0 ARDMORE HIGHER EDUCATION CENTER BUSINESS SERVICES NOT ELSEWHERE CLASSIFIED 115.120000
1 ARDMORE HIGHER EDUCATION CENTER CATALOG MERCHANTS -53.344286
2 ARDMORE HIGHER EDUCATION CENTER COMPUTER NETWORK/INFORMATION SERVICES 18.340000
3 ARDMORE HIGHER EDUCATION CENTER DRUG STORES AND PHARMACIES 153.450000
4 ARDMORE HIGHER EDUCATION CENTER GROCERY STORES,AND SUPERMARKETS 74.525455
In [14]:
def calculate_descriptive_stats(data, column):

    '''
    This function calculate descriptive statistics with additional quantile.
    '''
    
    # Calculate basic descriptive statistics
    desc_stats = data[column].describe()

    # Calculate the 90th and 95th percentiles
    perc_90 = data[column].quantile(0.90)
    perc_95 = data[column].quantile(0.95)

    # Convert the descriptive statistics to a DataFrame
    desc_stats = desc_stats.to_frame().transpose()

    # Add the 90th and 95th percentiles to the DataFrame
    desc_stats['90%'] = perc_90
    desc_stats['95%'] = perc_95

    # Reorder the columns to match the desired format
    desc_stats = desc_stats[['mean', 'min', '25%', '50%', '75%', '90%', '95%', 'max']]

    # Transpose the DataFrame
    desc_stats = desc_stats.T

    return desc_stats
In [15]:
table_1_stats = calculate_descriptive_stats(table_1, 'Average_Merchant_amount')
table_1_stats
Out[15]:
Average_Merchant_amount
mean 526.650020
min -3047.263333
25% 111.995000
50% 269.450000
75% 527.840500
90% 1007.506000
95% 1574.364300
max 171619.610000
In [16]:
table_1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8427 entries, 0 to 8426
Data columns (total 3 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Agency_Name              8427 non-null   object 
 1   Merchant_Category        8427 non-null   object 
 2   Average_Merchant_amount  8427 non-null   float64
dtypes: float64(1), object(2)
memory usage: 197.6+ KB
In [17]:
# Merge data back to the dataset 
df1 = pd.merge(df, table_1, how='left', on=['Agency_Name', 'Merchant_Category'])

# Calculate ratio
df1['Average_Merchant_amount_Ratio'] = np.where(df1['Average_Merchant_amount']==0,0, df1['Amount']/df1['Average_Merchant_amount'])
df1['Average_Merchant_amount_Ratio'].describe()
Out[17]:
count    442458.000000
mean          0.999724
std           7.225363
min       -2735.680000
25%           0.177145
50%           0.515301
75%           1.102916
max        2735.680000
Name: Average_Merchant_amount_Ratio, dtype: float64
In [18]:
# Define variables and quantile range
var = 'Average_Merchant_amount_Ratio'
binned_var = var + '_bin'
percentile = [0,0.01,0.05,0.2,0.5,0.8,0.95,0.99,1]

# Create the binned variable
df1[binned_var] = pd.qcut(df1[var], percentile)

# Create the bin labels
df1['bin_label'] = df1[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')

# Count the occurrences in each bin
bin_counts = df1['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']

# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
             color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.Plotly3)

# Update the layout
fig.update_layout(
    title='Distribution of Average Merchant Amount Ratio',
    xaxis_title='Quantile Range',
    yaxis_title='Count',
    template = 'simple_white'
)

# Show the plot
fig.show()

3.1.2 Avg Amount by Agency, Merchant Ratio in last 30 days ¶

In [19]:
# Feature 2: Average Amount by Agency, Merchant in last 30 days
table_2 = calculate_avg_amount(calculate_last_n_days(df, 30), agency_merchant_cols)
table_2 = table_2.rename(columns={'Avg_Amount': 'Average_Merchant_amount_30days'})
table_2.head()
Out[19]:
Agency_Name Merchant_Category Average_Merchant_amount_30days
0 ATTORNEY GENERAL ADVERTISING SERVICES 1040.000000
1 ATTORNEY GENERAL AMERICAN AIRLINES 462.781429
2 ATTORNEY GENERAL BUSINESS SERVICES NOT ELSEWHERE CLASSIFIED 436.428571
3 ATTORNEY GENERAL CABLE, SATELLITE, AND OTHER PAY TELEVISION AND... 24.590000
4 ATTORNEY GENERAL CHARITABLE AND SOCIAL SERVICE ORGANIZATIONS 875.000000
In [20]:
table_2.isnull().sum()
Out[20]:
Agency_Name                       0
Merchant_Category                 0
Average_Merchant_amount_30days    0
dtype: int64
In [21]:
table_2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3372 entries, 0 to 3371
Data columns (total 3 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   Agency_Name                     3372 non-null   object 
 1   Merchant_Category               3372 non-null   object 
 2   Average_Merchant_amount_30days  3372 non-null   float64
dtypes: float64(1), object(2)
memory usage: 79.2+ KB
In [22]:
table_2_stats = calculate_descriptive_stats(table_2, 'Average_Merchant_amount_30days')
table_2_stats
Out[22]:
Average_Merchant_amount_30days
mean 529.500247
min -2393.700000
25% 86.884113
50% 240.048810
75% 556.442353
90% 1089.924333
95% 1886.608333
max 45784.640000
In [23]:
# Add benchmarks back to original df
df1 = pd.merge(df1, table_2, how='left', on=['Agency_Name', 'Merchant_Category'])

df1['Average_Merchant_amount_30days'].fillna(0, inplace=True)

# Calculate the ratio
df1['Average_Merchant_amount_30days_Ratio'] = np.where(df1['Average_Merchant_amount_30days']==0,0, df1['Amount']/df1['Average_Merchant_amount_30days'])
df1['Average_Merchant_amount_30days_Ratio'].describe()
Out[23]:
count    442458.000000
mean          2.609376
std         630.043777
min       -6533.448873
25%           0.120986
50%           0.427965
75%           1.037891
max      308210.189408
Name: Average_Merchant_amount_30days_Ratio, dtype: float64
In [24]:
# Define variables and quantile range
var = 'Average_Merchant_amount_30days_Ratio'
binned_var = var + '_bin'
percentile = [0,0.01,0.05,0.2,0.5,0.8,0.95,0.99,1]

# Create the binned variable
df1[binned_var] = pd.qcut(df1[var], percentile)

# Create the bin labels
df1['bin_label'] = df1[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')

# Count the occurrences in each bin
bin_counts = df1['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']

# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
             color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.Plotly3)

# Update the layout
fig.update_layout(title='Distribution of Average Merchant Amount in last 30 days Ratio', xaxis_title='Quantile Range', yaxis_title='Count', template = 'simple_white'
)

# Show the plot
fig.show()

3.1.3 Avg Amount by Agency, Merchant Ratio in last 60 days ¶

In [25]:
# Feature 3: Average Amount by Agency, Merchant in last 60 days
table_3 = calculate_avg_amount(calculate_last_n_days(df, 60), agency_merchant_cols)
table_3 = table_3.rename(columns={'Avg_Amount': 'Average_Merchant_amount_60days'})
table_3.head()
Out[25]:
Agency_Name Merchant_Category Average_Merchant_amount_60days
0 ATTORNEY GENERAL ADVERTISING SERVICES 1040.000000
1 ATTORNEY GENERAL AMERICAN AIRLINES 465.248125
2 ATTORNEY GENERAL BOOK STORES 123.225000
3 ATTORNEY GENERAL BUSINESS SERVICES NOT ELSEWHERE CLASSIFIED 332.034545
4 ATTORNEY GENERAL CABLE, SATELLITE, AND OTHER PAY TELEVISION AND... 118.990000
In [26]:
table_3_stats = calculate_descriptive_stats(table_3, 'Average_Merchant_amount_60days')
table_3_stats
Out[26]:
Average_Merchant_amount_60days
mean 542.288064
min -995.600000
25% 96.037500
50% 265.452500
75% 551.635076
90% 1079.433628
95% 1742.393225
max 134734.050000
In [27]:
# Add benchmarks back to original df
df1 = pd.merge(df1, table_3, how='left', on=['Agency_Name', 'Merchant_Category'])

df1['Average_Merchant_amount_60days'].fillna(0, inplace=True)

# Calculate the ratio
df1['Average_Merchant_amount_60days_Ratio'] = np.where(df1['Average_Merchant_amount_60days']==0,0, df1['Amount']/df1['Average_Merchant_amount_60days'] )
df1['Average_Merchant_amount_60days_Ratio'].describe()
Out[27]:
count    4.424580e+05
mean     7.628855e+11
std      4.358540e+15
min     -8.104228e+17
25%      1.399177e-01
50%      4.544632e-01
75%      1.046130e+00
max      8.104228e+17
Name: Average_Merchant_amount_60days_Ratio, dtype: float64
In [28]:
# Define variables and quantile range
var = 'Average_Merchant_amount_60days_Ratio'
binned_var = var + '_bin'
percentile = [0,0.01,0.05,0.2,0.5,0.8,0.95,0.99,1]

# Create the binned variable
df1[binned_var] = pd.qcut(df1[var], percentile)

# Create the bin labels
df1['bin_label'] = df1[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')

# Count the occurrences in each bin
bin_counts = df1['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']

# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
             color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.Plotly3)

# Update the layout
fig.update_layout(title='Distribution of Average Merchant Amount in last 60 days Ratio', xaxis_title='Quantile Range', yaxis_title='Count', template = 'simple_white'
)

# Show the plot
fig.show()

3.1.4 Avg Amount by Agency, Merchant Ratio in last 90 days ¶

In [29]:
# Feature 4: Average Amount by Agency, Merchant in last 90 days
table_4 = calculate_avg_amount(calculate_last_n_days(df, 90), agency_merchant_cols)
table_4 = table_4.rename(columns={'Avg_Amount': 'Average_Merchant_amount_90days'})
table_4.head()
Out[29]:
Agency_Name Merchant_Category Average_Merchant_amount_90days
0 ATTORNEY GENERAL ADVERTISING SERVICES 1040.000000
1 ATTORNEY GENERAL AMERICAN AIRLINES 417.765667
2 ATTORNEY GENERAL AUTOMOTIVE PARTS AND ACCESSORIES STORES 97.215000
3 ATTORNEY GENERAL BOOK STORES 123.225000
4 ATTORNEY GENERAL BUSINESS SERVICES NOT ELSEWHERE CLASSIFIED 299.336923
In [30]:
table_4_stats = calculate_descriptive_stats(table_4, 'Average_Merchant_amount_90days')
table_4_stats
Out[30]:
Average_Merchant_amount_90days
mean 538.989403
min -842.720000
25% 102.350694
50% 270.000000
75% 555.795938
90% 1061.350727
95% 1641.144750
max 214206.120000
In [31]:
# Merge dfs
df1 = pd.merge(df1, table_4, how='left', on=['Agency_Name', 'Merchant_Category'])

df1['Average_Merchant_amount_90days'].fillna(0, inplace=True)

# Calculate the ratio
df1['Average_Merchant_amount_90days_Ratio'] = np.where(df1['Average_Merchant_amount_90days']==0,0, df1['Amount']/df1['Average_Merchant_amount_90days'])
df1['Average_Merchant_amount_90days_Ratio'].describe()
Out[31]:
count    4.424580e+05
mean     7.628855e+11
std      4.358540e+15
min     -8.104228e+17
25%      1.480874e-01
50%      4.679337e-01
75%      1.067796e+00
max      8.104228e+17
Name: Average_Merchant_amount_90days_Ratio, dtype: float64
In [32]:
# Define variables and quantile range
var = 'Average_Merchant_amount_90days_Ratio'
binned_var = var + '_bin'
percentile = [0,0.01,0.05,0.2,0.5,0.8,0.95,0.99,1]

# Create the binned variable
df1[binned_var] = pd.qcut(df1[var], percentile)

# Create the bin labels
df1['bin_label'] = df1[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')

# Count the occurrences in each bin
bin_counts = df1['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']

# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
             color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.Plotly3)

# Update the layout
fig.update_layout(title='Distribution of Average Merchant Amount in last 90 days Ratio', xaxis_title='Quantile Range', yaxis_title='Count', template = 'simple_white'
)

# Show the plot
fig.show()

3.1.5 Max Amount by Agency, Merchant Ratio in last 30 days ¶

In [33]:
# Feature 5: Max Amount by Agency, Merchant in last 30 days
table_5 = calculate_max_amount(calculate_last_n_days(df, 30), agency_merchant_cols)
table_5 = table_5.rename(columns={'Max_Amount': 'Max_Merchant_amount_30days'})
table_5.head()
Out[33]:
Agency_Name Merchant_Category Max_Merchant_amount_30days
0 ATTORNEY GENERAL ADVERTISING SERVICES 1040.00
1 ATTORNEY GENERAL AMERICAN AIRLINES 982.00
2 ATTORNEY GENERAL BUSINESS SERVICES NOT ELSEWHERE CLASSIFIED 1000.00
3 ATTORNEY GENERAL CABLE, SATELLITE, AND OTHER PAY TELEVISION AND... 24.59
4 ATTORNEY GENERAL CHARITABLE AND SOCIAL SERVICE ORGANIZATIONS 1325.00
In [34]:
table_5_stats = calculate_descriptive_stats(table_5, 'Max_Merchant_amount_30days')
table_5_stats
Out[34]:
Max_Merchant_amount_30days
mean 1552.831412
min -2393.700000
25% 118.747500
50% 425.155000
75% 1188.075000
90% 3002.620000
95% 4783.725000
max 132790.140000
In [35]:
# Merge dataset 
df1 = pd.merge(df1, table_5, how='left', on=['Agency_Name', 'Merchant_Category'])

df1['Max_Merchant_amount_30days'].fillna(0, inplace=True)

# Calculate the ratio
df1['Max_Merchant_amount_30days_Ratio'] = np.where(df1['Max_Merchant_amount_30days']==0,0, df1['Amount']/df1['Max_Merchant_amount_30days'])
df1['Max_Merchant_amount_30days_Ratio'].describe()
Out[35]:
count    442458.000000
mean          0.694637
std         145.885707
min       -6533.448873
25%           0.010224
50%           0.047619
75%           0.193889
max       70907.201862
Name: Max_Merchant_amount_30days_Ratio, dtype: float64
In [36]:
# Define variables and quantile range
var = 'Max_Merchant_amount_30days_Ratio'
binned_var = var + '_bin'
percentile = [0,0.01,0.05,0.2,0.5,0.8,0.95,0.99,1]

# Create the binned variable
df1[binned_var] = pd.qcut(df1[var], percentile)

# Create the bin labels
df1['bin_label'] = df1[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')

# Count the occurrences in each bin
bin_counts = df1['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']

# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
             color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.Plotly3)

# Update the layout
fig.update_layout(title='Distribution of Max Merchant Amount in last 30 days Ratio', xaxis_title='Quantile Range', yaxis_title='Count', template = 'simple_white'
)

# Show the plot
fig.show()

3.1.6 Max Amount by Agency, Merchant Ratio in last 60 days ¶

In [37]:
# Feature 6: Max Amount by Agency, Merchant in last 60 days
table_6 = calculate_max_amount(calculate_last_n_days(df, 60), agency_merchant_cols)
table_6 = table_6.rename(columns={'Max_Amount': 'Max_Merchant_amount_60days'})
table_6.head()
Out[37]:
Agency_Name Merchant_Category Max_Merchant_amount_60days
0 ATTORNEY GENERAL ADVERTISING SERVICES 1040.00
1 ATTORNEY GENERAL AMERICAN AIRLINES 982.00
2 ATTORNEY GENERAL BOOK STORES 208.50
3 ATTORNEY GENERAL BUSINESS SERVICES NOT ELSEWHERE CLASSIFIED 1000.00
4 ATTORNEY GENERAL CABLE, SATELLITE, AND OTHER PAY TELEVISION AND... 306.66
In [38]:
table_6_stats = calculate_descriptive_stats(table_6, 'Max_Merchant_amount_60days')
table_6_stats
Out[38]:
Max_Merchant_amount_60days
mean 1794.498448
min -995.600000
25% 154.360000
50% 488.925000
75% 1340.890000
90% 3515.500000
95% 4919.054500
max 348053.750000
In [39]:
# Merge dataset
df1 = pd.merge(df1, table_6, how='left', on=['Agency_Name', 'Merchant_Category'])

df1['Max_Merchant_amount_60days'].fillna(0, inplace=True)
# Calculate the ratio
df1['Max_Merchant_amount_60days_Ratio'] = np.where(df1['Max_Merchant_amount_60days']==0,0, df1['Amount']/df1['Max_Merchant_amount_60days'])
df1['Max_Merchant_amount_60days_Ratio'].describe()
Out[39]:
count    442458.000000
mean          0.610762
std         145.128071
min        -853.071672
25%           0.009709
50%           0.040093
75%           0.157241
max       70907.201862
Name: Max_Merchant_amount_60days_Ratio, dtype: float64
In [40]:
# Define variables and quantile range
var = 'Max_Merchant_amount_60days_Ratio'
binned_var = var + '_bin'
percentile = [0,0.01,0.05,0.2,0.5,0.8,0.95,0.99,1]

# Create the binned variable
df1[binned_var] = pd.qcut(df1[var], percentile)

# Create the bin labels
df1['bin_label'] = df1[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')

# Count the occurrences in each bin
bin_counts = df1['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']

# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
             color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.Plotly3)

# Update the layout
fig.update_layout(title='Distribution of Max Merchant Amount in last 60 days Ratio', xaxis_title='Quantile Range', yaxis_title='Count', template = 'simple_white'
)

# Show the plot
fig.show()

3.1.7 Max Amount by Agency, Merchant Ratio in last 90 days ¶

In [41]:
# Feature 7: Max Amount by Agency, Merchant in last 90 days
table_7 = calculate_max_amount(calculate_last_n_days(df, 90), agency_merchant_cols)
table_7 = table_7.rename(columns={'Max_Amount': 'Max_Merchant_amount_90days'})
table_7.head()
Out[41]:
Agency_Name Merchant_Category Max_Merchant_amount_90days
0 ATTORNEY GENERAL ADVERTISING SERVICES 1040.00
1 ATTORNEY GENERAL AMERICAN AIRLINES 982.00
2 ATTORNEY GENERAL AUTOMOTIVE PARTS AND ACCESSORIES STORES 186.17
3 ATTORNEY GENERAL BOOK STORES 208.50
4 ATTORNEY GENERAL BUSINESS SERVICES NOT ELSEWHERE CLASSIFIED 1000.00
In [42]:
table_7_stats = calculate_descriptive_stats(table_7, 'Max_Merchant_amount_90days')
table_7_stats
Out[42]:
Max_Merchant_amount_90days
mean 1906.054049
min -842.720000
25% 166.000000
50% 513.520000
75% 1448.627500
90% 3634.704000
95% 4982.698500
max 373150.260000
In [43]:
# Merge dataset
df1 = pd.merge(df1, table_7, how='left', on=['Agency_Name', 'Merchant_Category'])

df1['Max_Merchant_amount_90days'].fillna(0, inplace=True)

# Calculate the ratio
df1['Max_Merchant_amount_90days_Ratio'] = np.where(df1['Max_Merchant_amount_90days']==0,0, df1['Amount']/df1['Max_Merchant_amount_90days'])
df1['Max_Merchant_amount_90days_Ratio'].describe()
Out[43]:
count    442458.000000
mean          0.533116
std         144.864992
min        -464.136871
25%           0.009191
50%           0.035351
75%           0.138708
max       70907.201862
Name: Max_Merchant_amount_90days_Ratio, dtype: float64
In [44]:
# Define variables and quantile range
var = 'Max_Merchant_amount_90days_Ratio'
binned_var = var + '_bin'
percentile = [0,0.01,0.05,0.2,0.5,0.8,0.95,0.99,1]

# Create the binned variable
df1[binned_var] = pd.qcut(df1[var], percentile)

# Create the bin labels
df1['bin_label'] = df1[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')

# Count the occurrences in each bin
bin_counts = df1['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']

# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
             color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.Plotly3)

# Update the layout
fig.update_layout(title='Distribution of Max Merchant Amount in last 90 days Ratio', xaxis_title='Quantile Range', yaxis_title='Count', template = 'simple_white'
)

# Show the plot
fig.show()

3.1.8 Avg Transaction by Agency, Merchant Ratio ¶

In [45]:
# Feature 8: Average Transaction by Agency, Merchant
table_8 = calculate_avg_transaction(df, agency_merchant_cols)
table_8 = table_8.rename(columns={'Avg_Transaction_Count': 'Avg_Transaction'})
table_8.head()
Out[45]:
Agency_Name Merchant_Category Avg_Transaction
0 ARDMORE HIGHER EDUCATION CENTER BUSINESS SERVICES NOT ELSEWHERE CLASSIFIED 1
1 ARDMORE HIGHER EDUCATION CENTER CATALOG MERCHANTS 7
2 ARDMORE HIGHER EDUCATION CENTER COMPUTER NETWORK/INFORMATION SERVICES 1
3 ARDMORE HIGHER EDUCATION CENTER DRUG STORES AND PHARMACIES 1
4 ARDMORE HIGHER EDUCATION CENTER GROCERY STORES,AND SUPERMARKETS 11
In [46]:
table_8_stats = calculate_descriptive_stats(table_8, 'Avg_Transaction')
table_8_stats
Out[46]:
Avg_Transaction
mean 52.504806
min 1.000000
25% 2.000000
50% 5.000000
75% 17.000000
90% 67.000000
95% 168.700000
max 8102.000000
In [47]:
baseline_transaction=table_8['Avg_Transaction'].mean()
In [48]:
# Add average medicare payments by states to the dataset 
df1 = pd.merge(df1, table_8, how='left', on=['Agency_Name', 'Merchant_Category'])

# Calculate the ratio of the Average medicare payments to the Average medicare payments by states
df1['Avg_Transaction_Ratio'] = np.where(df1['Avg_Transaction']==0,0, baseline_transaction/df1['Avg_Transaction'])
df1['Avg_Transaction_Ratio'].describe()
Out[48]:
count    442458.000000
mean          1.000000
std           4.235557
min           0.006480
25%           0.020960
50%           0.070666
75%           0.352381
max          52.504806
Name: Avg_Transaction_Ratio, dtype: float64
In [49]:
# Define variables and quantile range
var = 'Avg_Transaction_Ratio'
binned_var = var + '_bin'
percentile = [0,0.01,0.05,0.2,0.5,0.8,0.95,0.99,1]

# Create the binned variable
df1[binned_var] = pd.qcut(df1[var], percentile, duplicates='drop')

# Create the bin labels
df1['bin_label'] = df1[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')

# Count the occurrences in each bin
bin_counts = df1['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']

# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
             color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.Plotly3)

# Update the layout
fig.update_layout(title='Distribution of Average Transaction Count Ratio', xaxis_title='Quantile Range', yaxis_title='Count', template = 'simple_white'
)

# Show the plot
fig.show()

3.1.9 Avg Transaction by Agency, Merchant Ratio in last 30 days ¶

In [50]:
table_9=calculate_avg_transaction(calculate_last_n_days(df, 30), agency_merchant_cols)
table_9 = table_9.rename(columns={'Avg_Transaction_Count': 'Avg_Transaction_30days'})
table_9.head()
Out[50]:
Agency_Name Merchant_Category Avg_Transaction_30days
0 ATTORNEY GENERAL ADVERTISING SERVICES 1
1 ATTORNEY GENERAL AMERICAN AIRLINES 7
2 ATTORNEY GENERAL BUSINESS SERVICES NOT ELSEWHERE CLASSIFIED 7
3 ATTORNEY GENERAL CABLE, SATELLITE, AND OTHER PAY TELEVISION AND... 1
4 ATTORNEY GENERAL CHARITABLE AND SOCIAL SERVICE ORGANIZATIONS 3
In [51]:
table_9_stats = calculate_descriptive_stats(table_9, 'Avg_Transaction_30days')
table_9_stats
Out[51]:
Avg_Transaction_30days
mean 10.469158
min 1.000000
25% 1.000000
50% 2.000000
75% 6.000000
90% 19.000000
95% 38.000000
max 643.000000
In [52]:
# Add average medicare payments by states to the dataset 
df1 = pd.merge(df1, table_9, how='left', on=['Agency_Name', 'Merchant_Category'])

df1['Avg_Transaction_30days'].fillna(0, inplace=True)

# Calculate the ratio of the Average medicare payments to the Average medicare payments by states
df1['Avg_Transaction_30days_Ratio'] = np.where(df1['Avg_Transaction_30days']==0,0, baseline_transaction/df1['Avg_Transaction_30days'])
df1['Avg_Transaction_30days_Ratio'].describe()
Out[52]:
count    442458.000000
mean          4.430460
std          10.344312
min           0.000000
25%           0.195913
50%           0.625057
75%           2.763411
max          52.504806
Name: Avg_Transaction_30days_Ratio, dtype: float64
In [53]:
# Define variables and quantile range
var = 'Avg_Transaction_30days_Ratio'
binned_var = var + '_bin'
percentile = [0,0.01,0.05,0.2,0.5,0.8,0.95,0.99,1]

# Create the binned variable
df1[binned_var] = pd.qcut(df1[var], percentile, duplicates='drop')

# Create the bin labels
df1['bin_label'] = df1[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')

# Count the occurrences in each bin
bin_counts = df1['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']

# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
             color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.Plotly3)

# Update the layout
fig.update_layout(title='Distribution of Average Transaction Count in last 30 days Ratio', xaxis_title='Quantile Range', yaxis_title='Count', template = 'simple_white'
)

# Show the plot
fig.show()

3.1.10 Avg Transaction by Agency, Merchant Ratio in last 60 days ¶

In [54]:
table_10=calculate_avg_transaction(calculate_last_n_days(df, 60), agency_merchant_cols)
table_10 = table_10.rename(columns={'Avg_Transaction_Count': 'Avg_Transaction_60days'})
table_10.head()
Out[54]:
Agency_Name Merchant_Category Avg_Transaction_60days
0 ATTORNEY GENERAL ADVERTISING SERVICES 1
1 ATTORNEY GENERAL AMERICAN AIRLINES 16
2 ATTORNEY GENERAL BOOK STORES 2
3 ATTORNEY GENERAL BUSINESS SERVICES NOT ELSEWHERE CLASSIFIED 11
4 ATTORNEY GENERAL CABLE, SATELLITE, AND OTHER PAY TELEVISION AND... 3
In [55]:
table_10_stats = calculate_descriptive_stats(table_10, 'Avg_Transaction_60days')
table_10_stats
Out[55]:
Avg_Transaction_60days
mean 15.985294
min 1.000000
25% 1.000000
50% 3.000000
75% 8.000000
90% 26.100000
95% 56.000000
max 1302.000000
In [56]:
# Add average medicare payments by states to the dataset 
df1 = pd.merge(df1, table_10, how='left', on=['Agency_Name', 'Merchant_Category'])

df1['Avg_Transaction_60days'].fillna(0, inplace=True)

# Calculate the ratio of the Average medicare payments to the Average medicare payments by states
df1['Avg_Transaction_60days_Ratio'] = np.where(df1['Avg_Transaction_60days']==0,0, baseline_transaction/ df1['Avg_Transaction_60days'])
df1['Avg_Transaction_60days_Ratio'].describe()
Out[56]:
count    442458.000000
mean          3.022885
std           8.245084
min           0.000000
25%           0.109613
50%           0.362102
75%           1.640775
max          52.504806
Name: Avg_Transaction_60days_Ratio, dtype: float64
In [57]:
# Define variables and quantile range
var = 'Avg_Transaction_60days_Ratio'
binned_var = var + '_bin'
percentile = [0,0.01,0.05,0.2,0.5,0.8,0.95,0.99,1]

# Create the binned variable
df1[binned_var] = pd.qcut(df1[var], percentile, duplicates='drop')

# Create the bin labels
df1['bin_label'] = df1[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')

# Count the occurrences in each bin
bin_counts = df1['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']

# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
             color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.Plotly3)

# Update the layout
fig.update_layout(title='Distribution of Average Transaction Count in last 60 days Ratio', xaxis_title='Quantile Range', yaxis_title='Count', template = 'simple_white'
)

# Show the plot
fig.show()

3.1.11 Avg Transaction by Agency, Merchant Ratio in last 90 days ¶

In [58]:
table_11=calculate_avg_transaction(calculate_last_n_days(df, 90), agency_merchant_cols)
table_11 = table_11.rename(columns={'Avg_Transaction_Count': 'Avg_Transaction_90days'})
table_11.head()
Out[58]:
Agency_Name Merchant_Category Avg_Transaction_90days
0 ATTORNEY GENERAL ADVERTISING SERVICES 1
1 ATTORNEY GENERAL AMERICAN AIRLINES 30
2 ATTORNEY GENERAL AUTOMOTIVE PARTS AND ACCESSORIES STORES 2
3 ATTORNEY GENERAL BOOK STORES 2
4 ATTORNEY GENERAL BUSINESS SERVICES NOT ELSEWHERE CLASSIFIED 13
In [59]:
table_11_stats = calculate_descriptive_stats(table_11, 'Avg_Transaction_90days')
table_11_stats
Out[59]:
Avg_Transaction_90days
mean 21.052301
min 1.000000
25% 1.000000
50% 3.000000
75% 10.000000
90% 33.000000
95% 74.150000
max 2146.000000
In [60]:
# Add average medicare payments by states to the dataset 
df1 = pd.merge(df1, table_11, how='left', on=['Agency_Name', 'Merchant_Category'])

df1['Avg_Transaction_90days'].fillna(0, inplace=True)

# Calculate the ratio of the Average medicare payments to the Average medicare payments by states
df1['Avg_Transaction_90days_Ratio'] = np.where(df1['Avg_Transaction_90days']==0,0, baseline_transaction/df1['Avg_Transaction_90days'])
df1['Avg_Transaction_90days_Ratio'].describe()
Out[60]:
count    442458.000000
mean          2.336722
std           7.166343
min           0.000000
25%           0.073536
50%           0.245350
75%           1.166773
max          52.504806
Name: Avg_Transaction_90days_Ratio, dtype: float64
In [61]:
# Define variables and quantile range
var = 'Avg_Transaction_90days_Ratio'
binned_var = var + '_bin'
percentile = [0,0.01,0.05,0.2,0.5,0.8,0.95,0.99,1]

# Create the binned variable
df1[binned_var] = pd.qcut(df1[var], percentile, duplicates='drop')

# Create the bin labels
df1['bin_label'] = df1[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')

# Count the occurrences in each bin
bin_counts = df1['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']

# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
             color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.Plotly3)

# Update the layout
fig.update_layout(title='Distribution of Average Transaction Count in last 90 days Ratio', xaxis_title='Quantile Range', yaxis_title='Count', template = 'simple_white'
)

# Show the plot
fig.show()

3.1.12 Avg Amount by Agency, Vendor Ratio ¶

In [62]:
table_12 = calculate_avg_amount(df, agency_vendor_cols)
table_12 = table_12.rename(columns={'Avg_Amount': 'Average_Vendor_amount'})
table_12.head()
Out[62]:
Agency_Name Vendor Average_Vendor_amount
0 ARDMORE HIGHER EDUCATION CENTER AGRI PRODUCTS 92.500000
1 ARDMORE HIGHER EDUCATION CENTER CDW GOVERNMENT -53.344286
2 ARDMORE HIGHER EDUCATION CENTER COUNTY BUILDING CE 58.440000
3 ARDMORE HIGHER EDUCATION CENTER EMPIRE PAPER COMPANY 443.540000
4 ARDMORE HIGHER EDUCATION CENTER GODADDY.COM 18.340000
In [63]:
table_12_stats = calculate_descriptive_stats(table_12, 'Average_Vendor_amount')
table_12_stats
Out[63]:
Average_Vendor_amount
mean 4.471355e+02
min -4.907220e+03
25% 2.900000e+01
50% 1.561667e+02
75% 4.560000e+02
90% 8.486420e+02
95% 1.432209e+03
max 1.827119e+06
In [64]:
# Add average medicare payments by states to the dataset 
df1 = pd.merge(df1, table_12, how='left', on=['Agency_Name', 'Vendor'])

# Calculate the ratio of the Average medicare payments to the Average medicare payments by states
df1['Average_Vendor_amount_Ratio'] = np.where(df1['Average_Vendor_amount']==0,0, df1['Amount']/df1['Average_Vendor_amount'])
df1['Average_Vendor_amount_Ratio'].describe()
Out[64]:
count    4.424580e+05
mean     9.931041e-01
std      1.963144e+14
min     -6.944551e+16
25%      3.399096e-01
50%      9.543105e-01
75%      1.040700e+00
max      6.424385e+16
Name: Average_Vendor_amount_Ratio, dtype: float64
In [65]:
# Define variables and quantile range
var = 'Average_Vendor_amount_Ratio'
binned_var = var + '_bin'
percentile = [0,0.01,0.05,0.2,0.5,0.8,0.95,0.99,1]

# Create the binned variable
df1[binned_var] = pd.qcut(df1[var], percentile, duplicates='drop')

# Create the bin labels
df1['bin_label'] = df1[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')

# Count the occurrences in each bin
bin_counts = df1['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']

# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
             color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.Plotly3)

# Update the layout
fig.update_layout(title='Distribution of Average Vendor Ratio', xaxis_title='Quantile Range', yaxis_title='Count', template = 'simple_white'
)

# Show the plot
fig.show()

3.1.13 Avg Amount by Agency, Vendor Ratio in last 30 days ¶

In [66]:
table_13=calculate_avg_amount(calculate_last_n_days(df, 30), agency_vendor_cols)
table_13 = table_13.rename(columns={'Avg_Amount': 'Average_Vendor_amount_30days'})
table_13.head()
Out[66]:
Agency_Name Vendor Average_Vendor_amount_30days
0 ATTORNEY GENERAL ADI ASPEN PUBLISHERS 866.20
1 ATTORNEY GENERAL AMERICAN AI 0017450939750 -250.03
2 ATTORNEY GENERAL AMERICAN AI 0017456015142 982.00
3 ATTORNEY GENERAL AMERICAN AI 0017459356062 669.00
4 ATTORNEY GENERAL AMERICAN AI 0017459356079 687.50
In [67]:
table_13_stats = calculate_descriptive_stats(table_13, 'Average_Vendor_amount_30days')
table_13_stats
Out[67]:
Average_Vendor_amount_30days
mean 507.333145
min -4987.040000
25% 40.680000
50% 162.802778
75% 474.000000
90% 1031.480000
95% 1822.994000
max 60347.380000
In [68]:
# Add average medicare payments by states to the dataset 
df1 = pd.merge(df1, table_13, how='left', on=['Agency_Name', 'Vendor'])

df1['Average_Vendor_amount_30days'].fillna(0, inplace=True)

# Calculate the ratio of the Average medicare payments to the Average medicare payments by states
df1['Average_Vendor_amount_30days_Ratio'] = np.where(df1['Average_Vendor_amount_30days']==0,0, df1['Amount']/ df1['Average_Vendor_amount_30days'])
df1['Average_Vendor_amount_30days_Ratio'].describe()
Out[68]:
count    4.424580e+05
mean     8.161432e-01
std      1.424394e+14
min     -6.944551e+16
25%      0.000000e+00
50%      1.390245e-01
75%      8.809942e-01
max      6.424385e+16
Name: Average_Vendor_amount_30days_Ratio, dtype: float64
In [69]:
# Define variables and quantile range
var = 'Average_Vendor_amount_30days_Ratio'
binned_var = var + '_bin'
percentile = [0,0.01,0.05,0.2,0.5,0.8,0.95,0.99,1]

# Create the binned variable
df1[binned_var] = pd.qcut(df1[var], percentile, duplicates='drop')

# Create the bin labels
df1['bin_label'] = df1[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')

# Count the occurrences in each bin
bin_counts = df1['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']

# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
             color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.Plotly3)

# Update the layout
fig.update_layout(title='Distribution of Average Vendor in last 30 days Ratio', xaxis_title='Quantile Range', yaxis_title='Count', template = 'simple_white'
)

# Show the plot
fig.show()

3.1.14 Avg Amount by Agency, Vendor Ratio in last 60 days ¶

In [70]:
table_14 =calculate_avg_amount(calculate_last_n_days(df, 60), agency_vendor_cols)
table_14 = table_14.rename(columns={'Avg_Amount': 'Average_Vendor_amount_60days'})
table_14.head()
Out[70]:
Agency_Name Vendor Average_Vendor_amount_60days
0 ATTORNEY GENERAL ADI ASPEN PUBLISHERS 866.200
1 ATTORNEY GENERAL AMAZON MKTPLACE PMTS 123.225
2 ATTORNEY GENERAL AMERICAN AI 0017450939750 205.985
3 ATTORNEY GENERAL AMERICAN AI 0017450939785 -361.000
4 ATTORNEY GENERAL AMERICAN AI 0017450939901 250.000
In [71]:
table_14_stats = calculate_descriptive_stats(table_14, 'Average_Vendor_amount_60days')
table_14_stats
Out[71]:
Average_Vendor_amount_60days
mean 492.806725
min -4999.000000
25% 37.000000
50% 162.000000
75% 475.000000
90% 999.735000
95% 1734.096500
max 348053.750000
In [72]:
# Add average medicare payments by states to the dataset 
df1 = pd.merge(df1, table_14, how='left', on=['Agency_Name', 'Vendor'])

df1['Average_Vendor_amount_60days'].fillna(0, inplace=True)

# Calculate the ratio of the Average medicare payments to the Average medicare payments by states
df1['Average_Vendor_amount_60days_Ratio'] = np.where(df1['Average_Vendor_amount_60days']==0,0, df1['Amount']/df1['Average_Vendor_amount_60days'])
df1['Average_Vendor_amount_60days_Ratio'].describe()
Out[72]:
count    4.424580e+05
mean     8.131557e-01
std      1.424394e+14
min     -6.944551e+16
25%      0.000000e+00
50%      2.655243e-01
75%      1.000000e+00
max      6.424385e+16
Name: Average_Vendor_amount_60days_Ratio, dtype: float64
In [73]:
# Define variables and quantile range
var = 'Average_Vendor_amount_60days_Ratio'
binned_var = var + '_bin'
percentile = [0,0.01,0.05,0.2,0.5,0.8,0.95,0.99,1]

# Create the binned variable
df1[binned_var] = pd.qcut(df1[var], percentile, duplicates='drop')

# Create the bin labels
df1['bin_label'] = df1[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')

# Count the occurrences in each bin
bin_counts = df1['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']

# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
             color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.Plotly3)

# Update the layout
fig.update_layout(title='Distribution of Average Vendor in last 60 days Ratio', xaxis_title='Quantile Range', yaxis_title='Count', template = 'simple_white'
)

# Show the plot
fig.show()

3.1.15 Avg Amount by Agency, Vendor Ratio in last 90 days ¶

In [74]:
table_15 =calculate_avg_amount(calculate_last_n_days(df, 90), agency_vendor_cols)
table_15 = table_15.rename(columns={'Avg_Amount': 'Average_Vendor_amount_90days'})
table_15.head()
Out[74]:
Agency_Name Vendor Average_Vendor_amount_90days
0 ATTORNEY GENERAL ADI ASPEN PUBLISHERS 866.200
1 ATTORNEY GENERAL AMAZON MKTPLACE PMTS 123.225
2 ATTORNEY GENERAL AMERICAN AI 0017394725288 444.000
3 ATTORNEY GENERAL AMERICAN AI 0017395886798 673.000
4 ATTORNEY GENERAL AMERICAN AI 0017395886825 0.000
In [75]:
table_15_stats = calculate_descriptive_stats(table_15, 'Average_Vendor_amount_90days')
table_15_stats
Out[75]:
Average_Vendor_amount_90days
mean 467.539420
min -3889.570000
25% 36.185000
50% 164.430000
75% 478.858333
90% 975.000000
95% 1669.614000
max 298416.860000
In [76]:
# Add average medicare payments by states to the dataset 
df1 = pd.merge(df1, table_15, how='left', on=['Agency_Name', 'Vendor'])

df1['Average_Vendor_amount_90days'].fillna(0, inplace=True)

# Calculate the ratio of the Average medicare payments to the Average medicare payments by states
df1['Average_Vendor_amount_90days_Ratio'] = np.where(df1['Average_Vendor_amount_90days']==0,0, df1['Amount']/df1['Average_Vendor_amount_90days'])
df1['Average_Vendor_amount_90days_Ratio'].describe()
Out[76]:
count    4.424580e+05
mean     8.885308e-01
std      1.424394e+14
min     -6.944551e+16
25%      0.000000e+00
50%      3.731346e-01
75%      1.000000e+00
max      6.424385e+16
Name: Average_Vendor_amount_90days_Ratio, dtype: float64
In [77]:
# Define variables and quantile range
var = 'Average_Vendor_amount_90days_Ratio'
binned_var = var + '_bin'
percentile = [0,0.01,0.05,0.2,0.5,0.8,0.95,0.99,1]

# Create the binned variable
df1[binned_var] = pd.qcut(df1[var], percentile, duplicates='drop')

# Create the bin labels
df1['bin_label'] = df1[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')

# Count the occurrences in each bin
bin_counts = df1['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']

# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
             color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.Plotly3)

# Update the layout
fig.update_layout(title='Distribution of Average Vendor in last 90 days Ratio', xaxis_title='Quantile Range', yaxis_title='Count', template = 'simple_white'
)

# Show the plot
fig.show()

Data Preparation for Modeling¶

In [79]:
# Extracting only the ratio columns
ratio_columns = [col for col in df1.columns if col.endswith('Ratio')]
X = df1[ratio_columns]
X.head()
Out[79]:
Average_Merchant_amount_Ratio Average_Merchant_amount_30days_Ratio Average_Merchant_amount_60days_Ratio Average_Merchant_amount_90days_Ratio Max_Merchant_amount_30days_Ratio Max_Merchant_amount_60days_Ratio Max_Merchant_amount_90days_Ratio Avg_Transaction_Ratio Avg_Transaction_30days_Ratio Avg_Transaction_60days_Ratio Avg_Transaction_90days_Ratio Average_Vendor_amount_Ratio Average_Vendor_amount_30days_Ratio Average_Vendor_amount_60days_Ratio Average_Vendor_amount_90days_Ratio
0 2.081219 1.741068 1.929814 2.279633 0.404545 0.263002 0.190098 0.050388 0.719244 0.354762 0.190927 1.243789 0.000000 0.000000 0.000000
1 0.974461 0.564751 0.715171 0.562686 0.208580 0.208580 0.057642 0.290082 4.375400 2.019416 1.280605 0.781845 0.000000 1.257404 0.281362
2 0.422515 0.378935 0.394945 0.404276 0.033164 0.033164 0.033164 0.024455 0.312529 0.157200 0.104383 1.801787 0.000000 -6.842170 1.974195
3 1.611007 1.704157 1.614249 1.554416 0.117421 0.117421 0.115509 0.006480 0.081656 0.040326 0.024466 1.544294 1.578127 1.714738 1.609596
4 0.464432 0.330686 0.422823 0.436213 0.026625 0.026625 0.026625 0.017822 0.256121 0.125011 0.078016 0.876658 0.663009 0.747109 0.876767
In [80]:
# Splitting data into train and test sets for X (features only)
X_train, X_test = train_test_split(X, test_size=0.2, random_state=123)
In [81]:
X_train_norm, X_test_norm = standardizer(X_train, X_test)

Section 4: Histogram-Based Outlier Score (HBOS)¶

Approach:

  • HBOS uses histograms to model the distribution of each feature independently. It divides the range of each feature into bins and computes the density of data points in each bin.

Scoring:

  • The anomaly score is based on the inverse of the density; regions with lower density (fewer data points) are considered more anomalous.

Advantages:

  • Simple and fast, particularly effective for univariate data or when features are independent.

Limitations:

  • May not capture interactions between features effectively. Sensitive to bin width and outliers.
In [137]:
# Define functions
def count_stat(vector):
    unique, counts = np.unique(vector, return_counts=True)
    return dict(zip(unique, counts))

def descriptive_stat_threshold(df, pred_scores, threshold):
    df = pd.DataFrame(df)
    df['Anomaly_Score'] = pred_scores
    df['Group'] = np.where(df['Anomaly_Score'] < threshold, 'Normal', 'Outlier')

    # Calculate count and percentage
    cnt = df.groupby('Group')['Anomaly_Score'].count().reset_index().rename(columns={'Anomaly_Score': 'Count'})
    cnt['Count %'] = (cnt['Count'] / cnt['Count'].sum()) * 100

    # Calculate mean statistics
    stat = df.groupby('Group').mean().round(2).reset_index()

    # Merge count and mean statistics
    stat = cnt.merge(stat, left_on='Group',right_on='Group')
    return stat

# Initialize parameters for hyperparameter tuning
param_grid = {'n_bins': [300, 600, 700], 'contamination': [0.01, 0.02, 0.03, 0.04, 0.05], 'alpha': [0.1, 0.2, 0.5, 1.0]}

4.1 HBOS¶

In [138]:
# Initialize HBOS model
hbos = HBOS()

# Perform grid search for hyperparameter tuning
grid_search = GridSearchCV(estimator=hbos, param_grid=param_grid, scoring='roc_auc')
grid_search.fit(X_train_norm)

# Get the best hyperparameters
best_params = grid_search.best_params_

# Use the best hyperparameters to train the final HBOS model
best_hbos = HBOS(**best_params)
best_hbos.fit(X_train_norm)

# Get the anomaly scores and predictions
y_train_scores = best_hbos.decision_function(X_train_norm)
y_train_pred = best_hbos.predict(X_train_norm)
y_test_scores = best_hbos.decision_function(X_test_norm)
y_test_pred = best_hbos.predict(X_test_norm)

# Print threshold and count statistics
print(f"Best hyperparameters for HBOS: {best_params}")
threshold = best_hbos.threshold_
print(f"The threshold for HBOS: {threshold}")
print(f"The training data (counts) for HBOS: {count_stat(y_train_pred)}")
print(f"The test data (counts) for HBOS: {count_stat(y_test_pred)}", '\n')
Best hyperparameters for HBOS: {'alpha': 0.1, 'contamination': 0.01, 'n_bins': 300}
The threshold for HBOS: 16.35766384715818
The training data (counts) for HBOS: {0: 350458, 1: 3508}
The test data (counts) for HBOS: {0: 87620, 1: 872} 

In [139]:
# Function to plot histograms with threshold annotation and color differentiation
def plot_histogram_with_threshold(scores, title, threshold):
    fig = go.Figure()

    # Split the scores into two parts: below the threshold and above/equal to the threshold
    below_threshold = scores[scores < threshold]
    above_threshold = scores[scores >= threshold]

    # Add histogram traces for the two parts
    fig.add_trace(go.Histogram(x=below_threshold, name='Normal', marker=dict(color='#2b7574')))
    fig.add_trace(go.Histogram(x=above_threshold, name='Outliers', marker=dict(color='#c83a21')))
    
    # Add vertical line annotating the threshold
    fig.add_vline(x=threshold, line_dash="dash", line_color="#0d5087", annotation_text="Threshold",
                  annotation_position="top right", annotation=dict(font=dict(color="#27496d", size=12)))

    # Update layout
    fig.update_layout(
        title=title,
        xaxis_title="Anomaly Score",
        yaxis_title="Frequency",
        showlegend=True,
        height=450,
        width=800,
        template='plotly_white',
        paper_bgcolor='rgba(0,0,0,0)',
        plot_bgcolor='rgba(0,0,0,0)'
    )
    
    # Show the plot
    fig.show()

# Plotting the combination by average for test data
plot_histogram_with_threshold(np.array(y_test_scores), "Anomaly Scores for HBOS Test Data", threshold)
In [140]:
# Get the descriptive statistics for outliers
stats_train_hbos = descriptive_stat_threshold(X_train, y_train_scores, threshold)
stats_test_hbos = descriptive_stat_threshold(X_test, y_test_scores, threshold)
In [141]:
stats_train_hbos
Out[141]:
Group Count Count % Average_Merchant_amount_Ratio Average_Merchant_amount_30days_Ratio Average_Merchant_amount_60days_Ratio Average_Merchant_amount_90days_Ratio Max_Merchant_amount_30days_Ratio Max_Merchant_amount_60days_Ratio Max_Merchant_amount_90days_Ratio Avg_Transaction_Ratio Avg_Transaction_30days_Ratio Avg_Transaction_60days_Ratio Avg_Transaction_90days_Ratio Average_Vendor_amount_Ratio Average_Vendor_amount_30days_Ratio Average_Vendor_amount_60days_Ratio Average_Vendor_amount_90days_Ratio Anomaly_Score
0 Normal 350383 98.987756 1.01 1.13 2.293682e+12 2.293682e+12 0.31 0.24 0.21 0.96 4.21 2.87 2.21 2.446502e+11 0.81 0.81 0.88 -0.01
1 Outlier 3583 1.012244 1.09 105.68 1.335400e+02 1.066500e+02 30.40 27.17 21.42 4.53 26.23 19.23 14.96 -1.510915e+13 1.29 1.30 1.50 17.31
In [142]:
stats_test_hbos
Out[142]:
Group Count Count % Average_Merchant_amount_Ratio Average_Merchant_amount_30days_Ratio Average_Merchant_amount_60days_Ratio Average_Merchant_amount_90days_Ratio Max_Merchant_amount_30days_Ratio Max_Merchant_amount_60days_Ratio Max_Merchant_amount_90days_Ratio Avg_Transaction_Ratio Avg_Transaction_30days_Ratio Avg_Transaction_60days_Ratio Avg_Transaction_90days_Ratio Average_Vendor_amount_Ratio Average_Vendor_amount_30days_Ratio Average_Vendor_amount_60days_Ratio Average_Vendor_amount_90days_Ratio Anomaly_Score
0 Normal 87603 98.995389 0.95 1.09 -5.320852e+12 -5.320852e+12 0.28 0.23 0.2 0.97 4.20 2.82 2.21 -3.605491e+11 0.82 0.81 0.90 0.01
1 Outlier 889 1.004611 1.29 319.57 4.121900e+02 3.882400e+02 73.35 75.81 74.9 4.60 26.54 19.56 15.06 1.000000e+00 1.11 1.01 0.92 17.29

Key Takeaways

A total of 4,460 outliers were captured by the HBOS model which account for 1% of the train and test dataset respectively. Anomaly scores saw a large difference between normal and outlier groups.

4.2 Combination by Average ¶

In [208]:
# Test a range of binning 
k_list = [50, 100, 150, 200, 300, 350, 400, 600, 700, 800]
n_clf = len(k_list)
n_bins= 300
# Just prepare data frames so we can store the model results
train_scores = np.zeros([X_train.shape[0], n_clf])
test_scores = np.zeros([X_test.shape[0], n_clf])

# Modeling
for i in range(n_clf):
    k = k_list[i]
    hbos_avg = HBOS(n_bins=n_bins)
    hbos_avg.fit(X_train_norm)
    # Store the results in each column:
    train_scores[:, i] = hbos_avg.decision_function(X_train_norm) 
    test_scores[:, i] = hbos_avg.decision_function(X_test_norm) 
# Decision scores have to be normalized before combination
train_scores_norm, test_scores_norm = standardizer(train_scores,test_scores)
In [209]:
# Combination by average
y_train_by_average = average(train_scores_norm)
y_test_by_average = average(test_scores_norm)
In [201]:
plt.hist(y_train_by_average, bins='auto') # arguments are passed to np.histogram
plt.title("Combination by average")
plt.show()
No description has been provided for this image

Based on the above histogram, I'll pick 2.1 as the threshold to identify the outlier.

In [235]:
plot_histogram_with_threshold(np.array(y_test_by_average), "Anomaly Scores for HBOS Combination by Average", 2.1)
In [215]:
# Get the descriptive statistics for outliers
stats_train_hbos_avg = descriptive_stat_threshold(X_train, y_train_by_average, 2.1)
stats_test_hbos_avg = descriptive_stat_threshold(X_test, y_test_by_average, 2.1)
In [216]:
stats_train_hbos_avg
Out[216]:
Group Count Count % Average_Merchant_amount_Ratio Average_Merchant_amount_30days_Ratio Average_Merchant_amount_60days_Ratio Average_Merchant_amount_90days_Ratio Max_Merchant_amount_30days_Ratio Max_Merchant_amount_60days_Ratio Max_Merchant_amount_90days_Ratio Avg_Transaction_Ratio Avg_Transaction_30days_Ratio Avg_Transaction_60days_Ratio Avg_Transaction_90days_Ratio Average_Vendor_amount_Ratio Average_Vendor_amount_30days_Ratio Average_Vendor_amount_60days_Ratio Average_Vendor_amount_90days_Ratio Anomaly_Score
0 Normal 350105 98.909217 1.01 1.13 2.295504e+12 2.295504e+12 0.31 0.24 0.21 0.96 4.19 2.86 2.20 2.448445e+11 0.81 0.81 0.88 -0.03
1 Outlier 3861 1.090783 1.10 98.09 1.242700e+02 9.925000e+01 28.00 25.30 19.94 4.34 26.08 18.78 14.62 -1.402126e+13 1.33 1.32 1.50 2.27
In [217]:
stats_test_hbos_avg
Out[217]:
Group Count Count % Average_Merchant_amount_Ratio Average_Merchant_amount_30days_Ratio Average_Merchant_amount_60days_Ratio Average_Merchant_amount_90days_Ratio Max_Merchant_amount_30days_Ratio Max_Merchant_amount_60days_Ratio Max_Merchant_amount_90days_Ratio Avg_Transaction_Ratio Avg_Transaction_30days_Ratio Avg_Transaction_60days_Ratio Avg_Transaction_90days_Ratio Average_Vendor_amount_Ratio Average_Vendor_amount_30days_Ratio Average_Vendor_amount_60days_Ratio Average_Vendor_amount_90days_Ratio Anomaly_Score
0 Normal 87533 98.916286 0.95 1.13 -5.325107e+12 -5.325107e+12 0.32 0.23 0.20 0.97 4.18 2.81 2.20 -3.608375e+11 0.82 0.81 0.90 -0.02
1 Outlier 959 1.083714 1.27 292.78 3.825800e+02 3.603300e+02 64.11 70.39 69.49 4.39 26.43 19.12 14.71 9.500000e-01 1.35 1.24 1.15 2.27

Key Takeaways

A total of 4,820 outliers were captured by the HBOS model which account for 1% of the train and test dataset respectively. Anomaly scores saw a large difference between normal and outlier groups.

Section 5: Empirical Cumulative Distribution-based Outlier Detection (ECOD) ¶

Approach:

  • ECOD uses the empirical cumulative distribution function (ECDF) to model the data. It considers the rank of each data point in the cumulative distribution.

Scoring:

  • The anomaly score is based on the empirical distribution of the data. Points far from the cumulative distribution curve are considered more anomalous.

Advantages:

  • Non-parametric and can handle non-linear relationships and interactions between features.

Limitations:

  • Can be computationally intensive for large datasets. Sensitive to the empirical distribution.
In [243]:
ecod = ECOD(contamination=0.01)
ecod.fit(X_train_norm)

# Training data
y_train_scores_ecod = ecod.decision_function(X_train_norm)
y_train_pred_ecod = ecod.predict(X_train_norm)

# Test data
y_test_scores_ecod = ecod.decision_function(X_test_norm)
y_test_pred_ecod = ecod.predict(X_test_norm)

# Print threshold and count statistics
print(f"Best hyperparameters for ECOD: {best_params}")
threshold_ecod = ecod.threshold_
print(f"The threshold for ECOD: {threshold_ecod}")
print(f"The training data (counts) for ECOD: {count_stat(y_train_pred_ecod)}")
print(f"The test data (counts) for ECOD: {count_stat(y_test_pred_ecod)}", '\n')
Best hyperparameters for ECOD: {'alpha': 0.1, 'contamination': 0.01, 'n_bins': 300}
The threshold for ECOD: 56.87318616811985
The training data (counts) for ECOD: {0: 350426, 1: 3540}
The test data (counts) for ECOD: {0: 87581, 1: 911} 

In [242]:
plot_histogram_with_threshold(np.array(y_test_scores_ecod), "Anomaly Scores for ECOD", threshold_ecod)
In [244]:
# Get the descriptive statistics for outliers
stats_train_ecod_avg = descriptive_stat_threshold(X_train, y_train_scores_ecod, threshold_ecod)
stats_test_ecod_avg = descriptive_stat_threshold(X_test, y_test_scores_ecod, threshold_ecod)
In [245]:
stats_train_ecod_avg
Out[245]:
Group Count Count % Average_Merchant_amount_Ratio Average_Merchant_amount_30days_Ratio Average_Merchant_amount_60days_Ratio Average_Merchant_amount_90days_Ratio Max_Merchant_amount_30days_Ratio Max_Merchant_amount_60days_Ratio Max_Merchant_amount_90days_Ratio Avg_Transaction_Ratio Avg_Transaction_30days_Ratio Avg_Transaction_60days_Ratio Avg_Transaction_90days_Ratio Average_Vendor_amount_Ratio Average_Vendor_amount_30days_Ratio Average_Vendor_amount_60days_Ratio Average_Vendor_amount_90days_Ratio Anomaly_Score
0 Normal 350426 98.999904 0.98 1.09 1.122860e+13 1.122860e+13 0.31 0.22 0.19 0.99 4.35 2.98 2.30 4.141689e+08 0.75 0.77 0.84 23.60
1 Outlier 3540 1.000096 3.92 111.12 -8.844993e+14 -8.844993e+14 31.15 29.61 24.03 1.41 12.36 8.59 5.95 8.881370e+12 6.83 4.86 5.37 64.19
In [246]:
stats_test_ecod_avg
Out[246]:
Group Count Count % Average_Merchant_amount_Ratio Average_Merchant_amount_30days_Ratio Average_Merchant_amount_60days_Ratio Average_Merchant_amount_90days_Ratio Max_Merchant_amount_30days_Ratio Max_Merchant_amount_60days_Ratio Max_Merchant_amount_90days_Ratio Avg_Transaction_Ratio Avg_Transaction_30days_Ratio Avg_Transaction_60days_Ratio Avg_Transaction_90days_Ratio Average_Vendor_amount_Ratio Average_Vendor_amount_30days_Ratio Average_Vendor_amount_60days_Ratio Average_Vendor_amount_90days_Ratio Anomaly_Score
0 Normal 87581 98.970528 0.93 1.10 6.754294e+12 6.754294e+12 0.33 0.21 0.18 1.00 4.35 2.93 2.31 9.500000e-01 0.77 0.77 0.85 23.60
1 Outlier 911 1.029472 3.43 311.04 -1.160999e+15 -1.160999e+15 66.29 76.39 74.97 1.27 11.45 8.29 5.29 -3.467090e+13 6.00 5.21 5.68 64.32

Key Takeaways

A total of 4,451 outliers were captured by the ECOD model which account for 1% of the train and test dataset respectively. Anomaly scores saw a large difference between normal and outlier groups.

Section 6: Models Predictions Comparison¶

In [247]:
# Compare HBOS and ECOD predictions
comparison_df = pd.DataFrame({'HBOS_pred': y_test_pred, 'ECOD_pred': y_test_pred_ecod})

print(pd.crosstab(comparison_df['HBOS_pred'], comparison_df['ECOD_pred']))
ECOD_pred      0    1
HBOS_pred            
0          84085    0
1           3496  911
In [270]:
# Compute confusion matrix
cm = pd.crosstab(y_test_pred, y_test_pred_ecod)

# Plot confusion matrix as heatmap
trace2 = go.Heatmap(z=cm,
                    x=['Predicted 0', 'Predicted 1'],
                    y=['True 0', 'True 1'],
                    showscale=False,
                    colorscale=[
                        [0.0, "#c83a21"],
                        [0.02, "#2b7574"],
                        [1.0, "#ffffff"]   
                    ],
                    xgap=20,
                    ygap=20,
                    text=cm,
                    texttemplate="%{text}")

# Define layout
layout = go.Layout(
    title=dict(text="Confusion Matrix between HBOS and ECOD", x=0.5, y=0.9, xanchor='center', yanchor='top'),
    xaxis=dict(title='HBOS Pred', showticklabels=True),
    yaxis=dict(title='ECOD Pred', showticklabels=True),
    autosize=False,
    width=500,
    height=500,
    plot_bgcolor='rgba(0,0,0,0)',
    paper_bgcolor='rgba(0,0,0,0)'
)

# Plot heatmap
fig = go.Figure(data=[trace2], layout=layout)
fig.show()

Key Takeways from Crosstab:

Agree on Normal:

  • Both HBOS and ECOD predicted 84,085 instances as normal (0)

Agree on Outliers:

  • Both HBOS and ECOD predicted 911 instances as outliers (1)

Discrepancy in Predictions:

  • There are 0 instances where HBOS predicted normal (0), but ECOD predicted outliers (1)
  • There are 3,496 instances where HBOS predicted outliers (1), but ECOD predicted normal (0). This indicates that HBOS is more sensitive, potentially identifying more outliers but including some that ECOD considers normal.

The high count (84,085) of instances predicted as normal by both models indicates a strong agreement on what constitutes normal behavior.

Both models agreed on 911 instances being outliers. This agreement suggests that these instances are likely true outliers, as both models identified them as such despite their different methodologies.

ECOD is stricter with a higher threshold, leading to fewer false positives and potentially misses some outliers that HBOS catches (higher false negatives). On the other hand, HBOS is more sensitive with a lower threshold, leading to more outliers detected, including some instances that ECOD considers normal (higher false postives)

Section 7: Conclusion¶

Detection Method Comparison:

HBOS tends to identify fewer outliers with scores more tightly clustered around the center. ECOD detects a broader range of outliers, showing more granularity in the outlier scores.

Outliers Locations:

Outliers are generally detected with higher scores (above 2 for HBOS and above 5 for ECOD). These outliers are found at the tails of the distribution in both training and test datasets.

In the context of credit card transaction anomaly detection, both models have their strengths and weaknesses. ECOD is more aggressive in flagging anomalies, which might be better for catching more fraudulent transactions at the cost of higher false positives. HBOS, being more conservative, might result in fewer disruptions to legitimate transactions but may miss some anomalies. Depending on the specific goals and risk tolerance of the anomaly detection system, either model, or a combination of both, could be employed to enhance the overall detection capability.